WRANGLERS

Which Countries are the Happiest?

Profiling

Photo by Alex Alvarez on Unsplash

Photo by Alex Alvarez on Unsplash

World Happiness Report

The World Happiness Report has proven to be an indispensable tool for policymakers
looking to better understand what makes people happy…
— Jeffrey Sachs


Ingest

df <- read_xls('./archetypes/happiness-report/happiness-report-2020.xls')
df

Dimensions

number of columns and rows

dim(df)
## [1] 1704   26

The output tells us that the data contains 1704 rows, and 26 columns.

Type Detection

List keys and column types

glimpse(df)
## Rows: 1,704
## Columns: 26
## $ `Country name`                                             <chr> "Afghanista~
## $ Year                                                       <dbl> 2008, 2009,~
## $ `Life Ladder`                                              <dbl> 3.723590, 4~
## $ `Log GDP per capita`                                       <dbl> 7.168690, 7~
## $ `Social support`                                           <dbl> 0.4506623, ~
## $ `Healthy life expectancy at birth`                         <dbl> 50.80, 51.2~
## $ `Freedom to make life choices`                             <dbl> 0.7181143, ~
## $ Generosity                                                 <dbl> 0.177888572~
## $ `Perceptions of corruption`                                <dbl> 0.8816863, ~
## $ `Positive affect`                                          <dbl> 0.5176372, ~
## $ `Negative affect`                                          <dbl> 0.2581955, ~
## $ `Confidence in national government`                        <dbl> 0.6120721, ~
## $ `Democratic Quality`                                       <dbl> -1.92968965~
## $ `Delivery Quality`                                         <dbl> -1.6550844,~
## $ `Standard deviation of ladder by country-year`             <dbl> 1.774662, 1~
## $ `Standard deviation/Mean of ladder by country-year`        <dbl> 0.4765997, ~
## $ `GINI index (World Bank estimate)`                         <dbl> NA, NA, NA,~
## $ `GINI index (World Bank estimate), average 2000-16`        <dbl> NA, NA, NA,~
## $ `gini of household income reported in Gallup, by wp5-year` <dbl> NA, 0.44190~
## $ `Most people can be trusted, Gallup`                       <dbl> NA, 0.28631~
## $ `Most people can be trusted, WVS round 1981-1984`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1989-1993`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1994-1998`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1999-2004`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2005-2009`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2010-2014`          <dbl> NA, NA, NA,~

The above provides a little more information. For example, we see that ‘Country name’ is a column of characters char, and that all other columns are numbers dbl. This is useful because we can already guess that ‘Year’ does not have the right type. It should not be treated as a number. We will fix it with the next command. Also notice the beginning values of each column; this is useful to get familiar with the data on hand. Some columns display a lot of NA, which indicates the absence of data.

df <- df %>% mutate(Year = as.factor(Year))
str(df$Year)
##  Factor w/ 14 levels "2005","2006",..: 4 5 6 7 8 9 10 11 12 13 ...

Column ‘Year’ is now a datatype called factor, which is a type of categorical variables in R. Levels are the possible values in the factor.

glimpse(df)
## Rows: 1,704
## Columns: 26
## $ `Country name`                                             <chr> "Afghanista~
## $ Year                                                       <fct> 2008, 2009,~
## $ `Life Ladder`                                              <dbl> 3.723590, 4~
## $ `Log GDP per capita`                                       <dbl> 7.168690, 7~
## $ `Social support`                                           <dbl> 0.4506623, ~
## $ `Healthy life expectancy at birth`                         <dbl> 50.80, 51.2~
## $ `Freedom to make life choices`                             <dbl> 0.7181143, ~
## $ Generosity                                                 <dbl> 0.177888572~
## $ `Perceptions of corruption`                                <dbl> 0.8816863, ~
## $ `Positive affect`                                          <dbl> 0.5176372, ~
## $ `Negative affect`                                          <dbl> 0.2581955, ~
## $ `Confidence in national government`                        <dbl> 0.6120721, ~
## $ `Democratic Quality`                                       <dbl> -1.92968965~
## $ `Delivery Quality`                                         <dbl> -1.6550844,~
## $ `Standard deviation of ladder by country-year`             <dbl> 1.774662, 1~
## $ `Standard deviation/Mean of ladder by country-year`        <dbl> 0.4765997, ~
## $ `GINI index (World Bank estimate)`                         <dbl> NA, NA, NA,~
## $ `GINI index (World Bank estimate), average 2000-16`        <dbl> NA, NA, NA,~
## $ `gini of household income reported in Gallup, by wp5-year` <dbl> NA, 0.44190~
## $ `Most people can be trusted, Gallup`                       <dbl> NA, 0.28631~
## $ `Most people can be trusted, WVS round 1981-1984`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1989-1993`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1994-1998`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1999-2004`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2005-2009`          <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2010-2014`          <dbl> NA, NA, NA,~

Check for Missing Values

for each column, check from NAs

missing_stats <- purrr::map_df(df, ~ sum(is.na(.))) %>%
  gather('Column name', 'Count of missing values')

missing_stats

We now know that the first 3 Columns are not missing any value, but ‘Log GDP per capita’ has 28 missing values.

Distinct Values

by Country and Years

For Country Name:

distinct_df <- distinct(df,`Country name`) %>% arrange(`Country name`)
distinct_df

For Year:

distinct_df <- distinct(df, Year) %>% arrange(Year)
distinct_df

Value Frequency

absolute and relative frequency for Year

df_1= table(df$Year)
df_2 <- as.data.frame(df_1) %>% 
  dplyr::rename(Year = Var1, Freq_absolute = Freq) %>% 
  mutate(Freq_relative=paste0(round(100*Freq_absolute/sum(Freq_absolute),digits=2),"%"))
df_2

For the year 2008, we have 110 records, which represents about 6.5% of the entire dataset.

Positive, Negative and Zero Values

for all the numerical columns

df1 <- df[,3:ncol(df)]

nRows <- dim(df1)[1]

calcStats <- function(x) {
  temp <- na.omit(df[, x])
  pos <- sum(temp > 0)
  is_zero <- sum(temp = 0)
  neg <-  sum(temp < 0)
  c("number of positives" = pos, "negatives" = neg, "zero" = is_zero)
}
  
result <- as.data.frame(Map(calcStats, colnames(df1)))
result

Histograms

for all numerical variables

df_long <- df %>%
  pivot_longer(
    `Life Ladder`:`Most people can be trusted, WVS round 2010-2014`,
    names_to = "measure",
    values_to = "value"
  )

v1 <- ggplot(df_long, aes(x=value)) +
  geom_histogram(fill = "#79B8E5") +
  facet_wrap(~ measure, scales="free")+
  theme(panel.grid = element_blank(), 
        strip.background = element_blank(),
        panel.background = element_blank()
 ) 

girafe(ggobj = v1, width_svg = 16, height_svg = 9, options =
  list(opts_sizing(rescale = TRUE, width = 1.0))
)

As we will see later in the course, the two variables of interest will be “Life Ladder” and “Log GDP per capita”. Just looking at the graph, we see that the “Life Ladder” varies from 0 to 8, with the majority of values between 4 and 6. “Log GDP per capita” varies from about 5 to 12. So the GDP per person would vary in this dataset from $150 to $160k.

References

citations for narrative and data sources:

  • Narrative and Data sources: The 7th World Happiness Report, GO
@misc{helliwell_2019_world,
  author = {  Helliwell,  John F. Helliwell and Layard, Richard  and Sachs, Jeffrey D. },
  title = {World Happiness Report 2019},
  url = {https://worldhappiness.report/ed/2019/},
  urldate = {2021-05-18},
  year = {2019},
  organization = {Worldhappiness.report}
}